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METHODS FOR TIME-ALIGNMENT OF LIQUID CHROMATOGRAPHY4MASS. 

• : iSPECmOJVffiTRYpAT^ 

FIELD OF THE INVENTION 

[0001] The present invention relates ; generally to analysis, of. data collected, by 
analytical techniques such as chromatography and spectrometry. More particularly, it 
relates to . methods for. tune-aligning midti-dhnensjonal chomatograms of different . 
samples to enable automated compaiison among sample data. 

BACKGROUND OF THE INVENTION . ■". ■ ^v -V ; v • 

[0002] The . high sensitivity and. resolution of liquid clnxmiatography-rnass 
spectrometry (LC-MS) make it an ideal tool fqr ; comprehensive, analysis of complex 
biological .. samples. Comparing spectra obtained from samples con-esponding to 
different patient cohorts (eg., diseased versus. non-diseased, or drug responders versus 
noivresponders) or subjected to different stimuli (e.g., drug administration regnnens) 
can yield valuable information about sample components correlated with particular 
conditions. Such components may serve as biological markers that enable earlier and 
more precise diagnosis, patient stratification, or prediction of clinical outcomes, They 
may silso guide, the discovery of suitable and novel 4mg targets. ; Because this 
approach extracts a large amount of information from a very small sample size, 
automated data, collection and analysis methods are desirable. 

. [0003] LC-MS data are reported as .intensity or abundance of ions . of varymg 
.mass-to-charge ratio (m/z) at varying chromatographic retention times A two- 
dimensional spectrum of LC-MS data from, a single sample is shown in FIG- 1, in 
which the darkness of points corresponds to signal intensity. A horizontal slice of the 
spectrum yields a mass chromatograin, the abundance of ions in a particular m/z range 
.as a function of . retention time.. A vertical ..slice . is a mass. sppctrum, a plot, pf 
. abundance of ions of varying m/z . at a particular retention tinae interval. . .The . two- 
dimensional data are. ; acquired by performing a mass , scan at regular intervals of 
retention time. Summing the. mass, spectrum at each retention time yields, a total ion 
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chromatograin (TIC), the abundance of all, ions as a function of retention time Local 
maxima in intensity (with respect to. both retention time and m/z) are referred to as 
peaks. In general, peaks may span several, retention time " scan intervals and m/z 



values. 



[0004] One significant obstacle for automated analysis of LC-MS data is the 
nonlinear variability of cinematographic retention tunes, , which can exceed the Width. : 
of pealcs along the retention time axis substantially This variability arises from, for 
example, changes in column chemistr.yoyer time, instrument drift, interactions among 
sample components, protein modifications, and minor changes in mobile phase 
composition. While constant time offsets can be corrected for easily, nonlinear 
variations are more problematic and . sigmficantly hamper the recognition of 
corresponding pealcs across sample spectra This problem is illustrated by the 
chi-omatograms of FIG. 2, in which the dotted and solid curves represent. total ion 
chromatograms of samples from two Afferent patients. While it can b? ^stmied that 
the dotted curvehas been time-slufted from .the solid curve, it is difficult to predict 
from the two curves to which of the two solid peaks the dotted peak corresponds. 

[0005] Various methods have been provided in the ait, for addressing the problem 
of chromatographic retention time shifts, including correlation, curve , fitting, and 
dynamic programming methods such as dynamic time warping and correlation 
optimized warping. For example, a time warping algorithm is applied to gas 
chromatogi-aphy/Fourier transform infrared (FT-IR)/mass spectrometry data from ,a 
gasoline sample in CP. Wang and T.L. Isenhour, "Time-warpmg algorithm applied to 
chromatographic peak matching gas cliromatography/Fourier transform infrared/mass 
spectrometry," Anal. Chem 59: 649-654, 1987. hi this method, a single FT-IR. 
interferogram is aligned with a TIC. While this method may be effective for simple 
samples, it may be inadequate for more complex samples such as biological fluids, 
which can contain thousands of different proteins and peptides, yielding thousands of 
potentially relevant and, more, importantly, densely spaced (in both m/z and retention 
• time) peaks. . 



[0006] There is still a need, therefore, for a robust method for time-aUgning 
chromatograpbic-mass spectromefric data. 



WO 03/095978 . ... V TCT^S03/1 4729 

BRIEF DESCRIPTION OF THE. FIGURES 

[0007] FIG. 1 (prior art) shows a sample two-dime^isioiial liquid , 

chiomatography-mass spectrometry (LC-MS) data set 
[0008] FIG. 2 is a schematic diagram of portions of total ion phromatpgiapis of 
" • two different samples, illustrating the difficulties jn properly time-aligning 
spectra. 

[0009] . FIG. 3 is a flow diagram of one embodiment of the present invention, a 

method for comparing samples 
[0010] * FIGS. 4A-4B illustrate, aspects of a dynamic time w.aipmg.(DTW) method 

according to one embodiment of the present invention, 
* [0011] FIG. 5 shows a grid of chromatographic time points, used in DJW, with 

an optimal route through the grid indicated 
[0012] FIGS. 6A-6B illustrate two constraints on a DTW method according to 

one embodiment of the present invention 
[0013] FIGS. 7A-7C illustrate aspects of a locally-weighted regression smoothing . 

method according to one embodiment of the present invention., 
[0014] FIGS. 8A-SB show coiTespondmg pealcs of ;one reference and three test 

LC-MS data sets ..before and after time-alignment by DTW. 
[0015] FIG. ? is a plot showing results of alignment of LC-MS. data.sets by robust . 
. LOESS and DTW. 

DETAILED DEISCRJPTION OF THE JNy^NTION 

[0016] Various embodiments of the present invention provide methods for time- 
aligning two-dimensional chromatpgraphy-mass spectrometry data sets, such as liquid 
chromatography-mass spectrometry (LC-MS) data .sets, also referred to as spectra. 
These data sets can have nonlinear variations in retention tune, .so that corresponding 
pealcs (i.e., peaks representing the same analyte) in different sanaples. elute from the 
cliromatograpMc coluinri at different/times, . Additional embodiments, provide 
methods for, comparing samples and data sets, .methods for identifying biological 
markers. (bipmarkers), aligned spectra produ^ to these methods, samples 

compared according to these .methods, bioinark^rs. . identified accoitiing to these 
methods, and . methods for using the identified b^ diagnostic, ai^d 

tlierapeutic applications. 
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[0017] The methods are effective at aligning iwo-dimensioiial data sets, obtained 
from both simple and complex samples. Although complex and simple, are relative 
terais and are not intended to lmiit . the scope of the prese in any way, 

complex samples typically have many more and more densely spaced spectral peaks 
than do simple samples. For examples, complex samples such as biological samples 
may have upwards of hundreds or thousands of peaks in. sixty minutes of retention 
time, such that the total iqn tomato gram (TIC) is too complex to allow resolution of 
individual features. Rather than use composite one-dimensional data such as the TIC, 
the methods in embodiments of the present invention use data from individual mass 
chromatograms, i.e., data representing abundances or intensities of ions in particular 
m/z ranges at particular retention times The m/z range included within a singly mass 
chi*om^togram ]'m$; reflect, the 

preprocessing (e.g ? , burning) of the raw data, and is typically on the order of between 
about 0.1 and L0 atomic mass unit (amu) Mass scans typically occur at intervals of 
between about one and about three seconds. 

[0018] In some embodiments of the present invention, computations are referred 
to as being performed "in dependence on at least two mass chromatograms from each 
data set/' This phrase is to be understood as referring to computations on individual 
data from a mass cliromatogram, rather than to data summed over a nuiiiber of 
chromatograms. 

[0019] While embodiments of the invention are described below with reference to 
chromatography and mass spectrometry, and particularly to liquid chromatography, it 
will be apparent to one of skill in the ait hpw to apply the methods to any other 
hyphenated chromatographic technique. For example, the second dimension may be 
any type of electromagnetic spectroscopy such . as microwave, far infrared, infrared, 
Raman or resonance Raman, visible, ultraviolet, far ultraviolet, vacuum ultraviolet, x- 
ray, or ultraviolet fluorescence or phosphorescence; any magnetic resonance 
spectroscopy, such as nuclear magnetic resonance (NMR) or electron paramagnetic 
resonance (EPR); and . any type of mass spectrometry, including ioiii?atiqn methods 
such as electron impact, chemical, thermospray, electrbspray, matrix assisted laser 
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desorption, and inductively coupled plasma ionization, and any detection methods, 
including sector, quadrupole, ion trap, time, of flight, and Foiuier transfprai detection,. 



[0020] .. Time-alignment methods are applied to. data sets acquired by performing 
. chromatograpliic and spectrometric. or spectrp.so.opic methods on chemical pi 
biological samples. The samples can be in any hoinogeheous or heterogeneous form \ 
that is compatible with the chromatographic instrument, for example, one ox more of a . 
gas, liquid, solid, gel, or liquid crystal. Biological samples that can be analyzed by 
embodiments of the present invention include, without limitation, whole organisms, 
parts of organisms (e.g, tissue samples); tissue homqgenates, extracts, infusions, 
suspensions, excretions, secretions, or emissions; administered and lecovered 
material; and culture supernatants Examples of biological fluids include, without 
limitation, whole blood, blood plasma, blood serum, unne, bile, cerebrospinal fluid, 
milk, saliva, mucus, sweat, gastric juice, pancreatic juice, seminal fluid, prostatic 
, fluid, sputum, broncheoalveolar lavage, and synovial fluid, and any pell suspensions,, 
extracts, or concentrates of these fluids. Non-biological samples include air, water, 
liquids from manufacturing wastes or processes, foods, and the like. Samples may be 
correlated, with particular subjects, cohorts, conditions, time points, or any other 
suitable descriptor or,category v . — . . . Y . 

[0021] FIG. 3 is a flow diagram of a general method. 20 according to one 
embodiment of the . present invention. The njethod is. typicaJly jmpl^ in 
software by a computer system in coniniunication.with an.anal^ical ins.to 
as a liquid chromatography-mass spectrometry (LC-MS) instalment. In a first step 
22, raw data sets are obtained,. e ? g., from the instrument, from a different computer 
system, or from a data storage devipe. The data sets, which are, aisq referred to as 
spectra or two-dimensional data sets or spectra, contain intensity values for discrete 
values (or ranges of values) of clironaatographic retention time (or scaii index) ..and 
niass-to-charge ratio, (m/z). At each ,scan time, of ..the instrument, . an entire , mass 
spectrum is obtained, and the collection of mass spectra for the c^oinatographic run 
of that sample makes up the ciata set. Typically, a collection of data sets is acquired 
from a large number (i.e., more than two) of samples before subsequent processing 
occurs. 
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[0022] In an optional next step. 24, the data sets are preproqessed - using 
conventional algorithms. Examples of preprocessing techniques applied include, 
without limitation, baseline subtraction, siiioothing, noise reduction, de-isotopmg, 
normalization, and peak list creation. Additionally, the data can be biiihed into, 
defined mfz intervals to create mass clnomatograms. Data are collected at discrete, 
scan times,. but m/z values in the mass spectra aie typically of very high mass, 
precision. In order to create mass cluomatogranis, data falling within a specified m/z . 
interval. (e.g.> 0.5. amu) are combined. into a composite value for that interval Any 
suitable binning algontlim may be employed; as is known in the ait, die selection of a 
binning algorithm and its. parameters may have implications for data smoothness, 
fidelity, and. quality. 

. [0023] hi step 26, a time-aligning algorithm is applied to one or more pair of data 
sets. One data set can be chosen (arbitrarily oi according to a criterion) to serve as a 
reference spectrum and all other data sets time-aligned to tins spectrum. For example, 
assuming the samples are analyzed on. the instalment .consecutively, the reference data 
set can correspond to the sample analyzed m the middle of the process. Alternatively, 
a feedback method can be implemented in which tlie degree of time shift is measured 
for each, data set, potentially with respect to one or more of the data sets chosen 
arbitrarily as a reference data. set, and the one with a. median time shift, according to 
some metric, selected as the reference data set. Data sets can also, be evaluated by a 
perceived or actual quality metric to deteimine which to select as the reference data 
set. • .V-"' '■ "; -'v. :': •. ■ ; 

[0024] After, the data sets are aligned to a common retention time scale, the 
aligned data. sets . can be compared automatically in step. 28 to locate features that 
differentiate the.spectra. For example, a peak that occurs in only certain spectra or at 
significantly different intensity levels in. different spectra may represent a biological 
marker or a component of a biologi cal marker that is indicative of or diagnostic for a 
characteristic of the relevant samples . (e f g., disease, response to therapy, patient group, 
disease progression). If desired, the identity of the ions responsible for , the 
distinguishing features can be identified. -Biological markers, may also be more 
complex combinations of spectral features or sample . components with or without 
other plinical or biological factors. Identifying spectral, differences and biological 
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markers is a multi-step process and will not be, described m detail her em For more 
information, see U.S. Patent Application No 09/994,576, ^Methods for Efficiently , 
Mining Broad • Data Sets for Biological Markers filed. 11/27/2QQ1, which -is 
incorporated herein by reference, hi general, tins step 28 is refened to as differential 
phenotyping, because differences among phenotypes, as represented by ihp 
comprehensive, (rather than selective). LOMS spectrum of expressed proteins and 
small molecules, are detected. 

[0025] Step 26, time-aligning pairs of spectra, can be implemented in many 
different ways. In one embodiment of the invention, spectra are aligned using a 
variation of a dynamic , time warping (DTW) method, DTW is a dynamic 
programming technique that was developed in the field of speech xecognition for 
time-aligning speech patterns and is described in H Salcoe and S. Chiba, ^Dynamic 
programming algorithm optimization for spoken word recognition," IEEE Trans 
Acoust, Speech, Signal Process. ASSP-26 43-49, 1978, which is incorporated herein 
• by reference. . : 

[0026] In embodiments of the present invention, DTW aligns two data sets by 
nonlinearly stretching and contracting ("warping") the time component of the data 
sets to synchronize spectral features and yield a minimum distance between the two 
spectra. In asymmetric DTW, a test data set is warped to align with a. reference data 
set, Alternatively, in symmetric DTW, both data sets are adjusted to fit a common 
time index. The following description is of asymmetric warping, but it will be 
apparent to one of ordinary skill m the art, upon reading this description, how to 



[0027] FIG- 4A is aplot of two chromatograms, labeled test and reference, whose 
time scales are nonlinearly. related. TTiat i^ 

referred to as con-esponding peaks (and the corresponding points that make up/these 
peaks), occur at different retention times, and there is no linear transformation of time 
components that will map corresponding pealcs to tlie same retention times. Although 
the. data are shown , as continuous ounces, each data .set consists of discrete values , (an 
entire mass spectrum) at a sequence of time indices; for clarity, only a single intensity 
value, rather than an entire mass spectrum, is shown at each tinie point. In the figure, 
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corresponding points. are connected by dashed lines, which represent a mapping of 
time points in the reference data set to time points in the test data set This mapping is 
shown more explicitly in the table of FIG. 4B. The object of a DTW algonthm is to 
identify this time .point mapping, fiom which, an aligned refemice data set may be 
constructed. Note that DTW aligns the. entire data set, and not just peaks of the data 
set, and that DTW yields a discrete time point mapping/ rather than a Amotion that 
transforms the original time, points into aligned time points As a result, some points 
(reference and . test) do not get mapped, and unmapped points can be handled as 
. described below. 

[0028] . Conceptually, the DTW method considers a set of possible time point 
mappings and identifies the mappmg that minimizes an accumulated distance function 
between the reference and test data sets Consider the grid in FIG. 5, in which rows 
correspond to I time indices i in the test data set and columns to /time indices 7 in the 
reference data set (Zand Jean be different). Each possible tune point mapping can be 
represented, as a route c(k) through this grid, where c(/:). = [r(/c), j(k)] and . 1 . <k %IC 
For example, if the test and reference data sets were perfectly aligned, the route would , 
be a diagonal beginning in the upper left cell and proceeding to the lower right cell of 
. the grid. . The selected route represents th? optimal time point mappmg. ', . ) ■ '■ '. 

[0029] The set of possible routes is limited by three types of constraints: mdpoint 
constraints; a local continuity constraint, which defines local features of the path, and 
a global constraint, which defines the allowable search space for the path The 
endpoint constraint equates the first and last time point in each data set. In the. grid, : 
the upper left and lower right cells are fixed as the start arid end of the path, 
respectively, i.e., c(l) = [l, 1] and c{K) = [J, J]. The local continuity constraint forces 
the path to be monotonic with a non-negative. slope, meaning that, for a path c(Jc) = 
[i(k)J(k)li(k+l) >i(/c) andy(/c+l) >j(/c). This, condition maintains the order of time 
points. An upper, bound can also be placed on the. slope, to. prevent excessive 
compression or expansion of time scales. . The result of these conciitions is that the 
path to an individual ..cell is. limited to one of tiie. three illus^ 6 A. Finally, 
tibie global constraint limits : the path to a specified number of grid places -from the 
diagonal, illustrated schematically in FIG. 6B. This latter constraints confines, the . 
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solution to one fliat is. physically, realizable while also, substantially lmutmg the 
computation time. \' % \ '-• -\ ■ ■■ ^V' v: . ;;V : .!\'y. 

[0030] The optimal path through tlie grid is one tliat minimizes the accumulated 
distance function between the test, and reference data sets over the route. Each cell [i, 
jf] has an associated distance function between data sets at the particular i andy time 
indices. The distance function can take a variety of different forms- If only a single 
chromatogram (e.g., the TIC) were considered, the distance function dij between 
points tj* and r/^ . would be: • 

: ■ : ■ . 

where I/ ef is the/ h intensity :value. of the reference spectrum •and//* 1 ' is thei? intensity 
value of the test spectrum. /In embodiments of the present invention, however, M 
mass chromatograms of each data set are considered in computing the distance 
function, where M >2, and so, in one embo.dnnent, the distance function is 

where Ikf ef is the/ 1 intensity value of the k" 1 reference chromatogram and I ki 1esi is tire 
f intensity value of the 1i h test, chromatogram. Both chromatograms are for a •, 
single m/z range Each cell of the gild in FIG. 5. is filled with the appropriate. value of 
the distance fimction, and a route. is chosen through the inatrix that nihiimizes the : 
accumulated distance function obtained by. summing the values in each cell traversed, 
subject to the aboverdescribed constaints. Note that the t\VQ. t^ims distanc^ air-route 
are not related; the distance, refers to a metric of the dissimilarity between data sets, 
while the route refers to a path through the grid and has no relevant distance . 

[0031] The route-finding problem can be. addressed using a dynamic 
programming approach, in which the larger optimization problem is reduced to a 
series of local, problems. At each allowable cell in the grid (FIG. fiB), the optimal 
one of the three, (FIG* 6A). single-step paths is identified. . Aft?r all cells have been 
considered, a globally qptiipal route is reconstructed by stepping baplcward^ .through. 
the grid from the last cell. For more Monnatipnon.d 

Corrnen et al., Introduction to Algorithms (2" d ed.), Cambridge: MIT Press (2001), 
which is incorporated herein by reference. 
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[0032] Locally optimal paths are selected by mimmiziug the accumulated distance 
: from the initial cell to the current cell. Foi the three potential smgle-step;pattis to the 
cell [ij], the accumulated distances are 

D,f= 2,-^ + 24,-^ + 4, j .1 

where represents the accumulated distance from [1 , 1] to {h j} when path p is 
traversed, dy is computed from equation (2), and JPj,jj : i, Dt-y-h and £>, ij.2, are 
evaluated in previous steps. Hie coefficient 2 is. a weighting factor that inclmes the . 
path to follow the diagonal. It may take on other values as desired The minimized 
. accumulated distance for the cell [i, /] is given by; 

This value is stored in an accimiulated distance matnx for use in subsequent 
calculations, and the selected value ofp is stored in an index matrix. 

.{00333 The dynamic programming algorithm proceeds by stepping through each , 
cell and finding and storing the minimum accumulated distances and optimal indices 
Typically the process begins at the top left cell of the grid and moves down through 
all allowed cells before moving to the next column, with the allowable cells in each 
column defined by the global search space. Aftei the final cell has been computed, 
• the optimal route is found by traversing the grid backwards to the starting cell [1, 1) 
based on optimal paths stored in the index matrix Note that the route cannot be 
constructed in the forward direction, because it is not lmown until subsequent 
calculations whether the cmrent cell will he on the optimal route Once the optimal 
route has been determined, an aligned test data set can be constructed 

[0034] Unless, me . test, and. reference, data sets, are perfectly aligned, there are 
points in both sets that do not get mapped. ; >Vhen the test time scale is compressed, 
some intermediate test points dp. not get mapped. These points are discarded^ When 
. the. test, time .scale is expanded, there are reference time points . for , which no ; 
'.. corresponding test, point exists. Values of the points can be estimated, e.g., by 
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linearly interpolating between intensity values, of surromidrng. points that have been 
, mapped to reference points 

[0035] . The above-described methods and steps ban be varied, in many ways 
without departing from the scope of the invention. For example, alternative 
constraints can be applied to the route (e.g , different allowable local slopes, end . 
points not fixed but rather. constrained to allowable regions, different global search 
space), and. alternative distance functions can be. employed. The weighting factors for 
local paths can be varied from the value 2 used in equations (3), Additionally, a 
nonnahzation factor can be included in the distance function The distance function 
above is based on intensity, but, depending on how the data set is represented, can be .. 
based on any other coefficient of features of the data, set, For example, the function 
can be computed from coefficients of wavelets, peaks, or derivatives by which the 
data set is represented. In this case, the distance is a measure of the degree of 
alignment of these features. ; 

[0036] In the equations above, the distance function is computed based on data 
from M. individual mass chromatograms Any value of M is within the scope of the 
present invention, as are any selection criteria by which chrornatograms are selected . 
. for inclusion. Reducing the number of chromatograms from the total number in the 
data set (e.g., 2000) to M can decrease the computation time, substantially. 
Additionally, excluding noisy chromatograms or those without peaks can improve. the. 
alignment accuracy. There is generally an optimal range of M that balances alignment 
accuracy and computation time, and it is beneficial to choose a value of M in the 
lower end of the range, i.e., a value that minimizes computation time; without 
sacrificing substantially the accuracy of time-alignment, It is. also beneficial to 
include diromatograms containing peaks throughout the range of retention time; this 
is particularly important near tlie beginning and end of the ctoom.atographic . run, when 
there are fewer peaks,. In one. embodiment, between about 200 . and about .400 
chromatograms are used. -. Alternatively, between about 200 ; and about 3.00 . 
chromatogranis are.used. In another embodiment, Mis about 200. . 

[0037] A variety of selection criteria can be . applied , individually or jointly to 
select the chromatograms with which the distance function, is computed. The 
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: selection criteria or (heir .parameters (e.g , intensity ftresholds) can be pi.edetenmned, > j 

■ computed at ran time, or selected by a user M. can be a selected value (manually or * 
automatically) or the lesult of applying the criterion or catena (1 e, M chiomatogrtuns 
happen to fit the criteria). 

[0038] One selection criterion is that a mass •chromatogram have peaks m both the 
reference and test data sets, as detemuned by a manual or automated peak selection 
. algorithm. Peak selection algorithms typically apply an mtensity threshold and 
identify local maxima exceeding the threshold as peaks The peaks may or may not 
be required to be corresponding (in ln/z and retention time) for the chrpmatogram to 
meet the criterion. If corresponding, peaks are .required, a.relatively large window in 
retention time is applied to account for the to-be-corrected retention time shifts 

[0039] Another selection criterion is that maximum, median, or average intensity 
values in a mass chromatogram exceed a specified intensity thieshold, or that a single 
peak intensity or maximum, median, or average peak intensity values in the 
chromatogram exceed an intensity thieshold. Alternatively, at least one individual 
peak intensity or the maximunr, median, or average peak intensity can be required to 
fall between upper and lower intensity level thresholds. Another .selection criterion is 
that the number of peaks in a mass chromatogram exceed a threshold value. . These 
criteria are typically applicable to both the reference and test mass chromatograms.. 



in the 



[0040] When the. selection criterion involves an intensity threshold, the threshold 
can be constant or vary with retention time to accommodate variations in mean or 
median signal intensity throughout' a cinematographic nm. Often, the beginning and 
end of the ran yields fewer and lower intensity -peaks than occur in the. middle of the 
run, and lower thresholds may be suitable for these regions 

[0041] According to an. alternative selection criterion, a set of the post orthogonal 
chromatograms is selected, i.e., the set that provides the most infqn^ an . 

analyte is present in chromatograms of adjacent m/z values, these chrpmatogranis 
may be redundant, providing no more information than is provided by, a single 
chromatogram. Standard correlation methods can. be applied to select orthogonal 
chromatograms.. The orthogonal chromatograms are sheeted to span. the elution toie 
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range, so . that just enough infpnriation is proyi.4e4, : to ^alig^ : fce^date ?ac^rat^ly . 
throughout the entire -range. In this case, the selection criterion contains an. 
orthogonality metric and a retention time range. 

[00.42) . Individual selection .criteria may be combined in many different, ways, Fpi 
example, in one composite selection criterion, peaks are first selected in the reference 
and test data sets using any suitable manual or automatic peak selection method 
Next, a filter is applied separately to the two. data sets to. yield two subsets of peaks 
This filter can be a single threshold or two (upper and lower) thresholds A lower 
threshold ensures tliat peaks are above the noise level, while an upper threshold 
excludes; falsely elevated values reflecting a saturated instrument detector 
. Correspondmg pealcs are then selected that appear in both the test and reference peak 
subsets. Chromatograms corresponding to these pealcs are included m cpmputmg the 
distance function. Alternatively, from the list of corresponding, peaks, M 
clirpmatogi'ams are chosen randomly. For example, if N corresponding pealcs are 
found, the chromatograms correspondmg to every N/M ih m/z value are selected 
Alternatively, the M chromatograms can be selected from the correspondmg peaks 
based on an intensity threshold or some, other criterion. 

[00431 When more than one test data set is aligned to the reference data set, .each 
pairwise alignment can he computed based on a different , set of .. independently- 
selected clupmatpgrams,. . • V" ■ 

[0044} In one embodiment of the invention, a weighting factor W k is included in 
the distance function, causing different chiomatograms to contribute unequally. As ,a. 
result, certaui chromatograms tend to dominate the sum and dictate the alignment.. 
The weighted distance function is: 

where Wk is the chromatog^ weighting factor. .. The ftactional. form or 

* value of. the weighting factor can be determined.: a .priori based on user knowledge of 
; the most relevant mass ranges. . Alternatively, the : wejgh&ig factor can be computed . 
. based on characteristics of the. data. ;For example, the weighting factor can be a. 

fuiiction of orje or more of the following variables: : the. number, of peaks per 
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chromatogram (peak number), * selected by any manual or automatic : metliod, the * 
signal-to-noise ratio in a chroinatograin; and peak threshold or intensities, 
Chromatograms having more peaks, higher signal4o r nbise ratio,, or higher peak 
intensities-are typically weighted more, than other xliromatograms Any additional . 
variables can be included in the weighting factor The factor can also depend on a 
combination of user knowledge and data values v 

[0045] In an alternative, embodiment of , the invention, the time-ali^mig step 26 
employs locally-weighted regression smoothing, Rather than act on the raw (or 
preprocessed) data, this method time-aligns selected peaks in test and reference data 
sets. Peaks, defined by m/z and retention, time values, are first selected from each 
data set by manual or automatic means. Potentially conesppndmg peaks are 
identified from the lists as peaks that fall within a specified range of m/z and retention 
time values/ FIG. 7A shows an excerpt of a reference peak list .and test peak list with 
potentially corresponding peaks shaded.. These/peaks are plotted on FIG. 7B, winch 
shows the .window surrounding the reference peak that defines a region, of potentially 
corresponding test peaks. Because, the nonlinear tune variations have not yet been 
corrected, tlie window has a relatively large retention, time range, accounting for the 
maximum retention time. variation throughout tlie chromatographic run (e.g., five 
minutes). 

[0046] For every pair of reference peak and . potentially .cpiTespo.nding test peak, 
the data are transformed from (t ref , t test ) to (t avg> At) T where t GVg ~ (^/+ t te *t)/2 and At - 
tref~ best* The resulting plot, for exemplary data sets, is shown in FIG. 7C It is 
apparent from FIG. 7C that the points tend to cluster around a cinve that represents 
tlie nonlinear time variation between reference and test data sets, Knowing this curve 
would enable correction of the time variation and ahgnment of tire data sets.. To do 
so, a smoothing algorithm is applied to the ti'ansfomied variables to yield a set of 
discrete values (t avg , At\ which can be, transformed back to ttesd- . Because the 
smoothing is applied to. data points representing peaks, and because the. result is a 
discrete mapping of points rather tlian. a functipn, adjusted, tim? values of data, points 
'.; between the peaks are then computed, e.g ; , by inte^oiatiQn s After all points have 
been mapped,, alibied data sets can be constracted. Typically, time points of the 
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reference data set are. fixed and the test data set modified. This process can be 
repeated to align all data sets to the reference data set 

[0047] One suitable smoothing algorithm is a LOESS, .algontbm (locally weighted 
scatterplot smooth), . originally proposed m W S, Cleveland, "Robust locally weiglited 
: regression and smoothing scatterplots," / Am Stat Assoc 74; 829-S36/ 1979, and 
• further developed in W S.. Cleveland and SJ, Devlin, "Locally weighted i egression: 
an approach to regression analysis by local fitting," J. Am Stat Assoc 83 596-610, 
1988, both of which are incorporated herein by reference. A LOESS fiipction 
(sometimes called LOWESS) is available m mmiy coixunercial mathematics and. 
statistics software packages, such as S-PLUS®, S AS, Mathematics and ^dATLAB® 

[0048] The LOESS method, described in more detail below, fits a polynomial 
locally to points in a window centered on a given point to be smoothed. Both the 
window size ("span") and polynomial degree must be selected. The span is typically ... 
specified as a percentage of the total number of points, In standard LOESS, a 
polynomial is fit to the span by weighting points in the window based on then 
distance from the point to be smoothed After fitting the polynomial, the smoothed 
point is replaced by the computed point, and the method proceeds to the next point, . 
: recalculating weights and fitting a new polynomial Each time, even though the entire 
span is fit by the polynomial, only the center point is adjusted Because the. method 
operates locally, it is quite effective at representing the fine nonlinear variations in 
chromatographic retention time. 

[0049] A robust version of LOESS, winch is more resistant to outhers, computes 
the smoothed points in an iterative fashion.by continuing to modify the weights until 
convergence (or based on a selected number of iterations). The iterative corrections 
are based on the residuals between the polynomial fit and the raw data points. , After 
the. points are fit using initial weights, subsequent weights are computed as the 
products of fiie initial weights and the new weights. Upon convergence, the span is. 
moved by one point and the entire process repeated. In this manner, the polynomial 
: regression weights are. based on both the distance from die point to. be smoothed 
.'(distance. in abscissa value) axid the ^distance .between the point : and the curve : fit 
(distance in ordinal value), yielding a very robust fit 
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[00501 Specific details of the robust LOESS fit are described below. It is to be 
understood that any variations, in parameters, weighting -factors; » and polynomial 
degree are within the scope of the present invention Each discrete (t a yg , A$ point is 
represented in the formulae below as (x t> yj). The. approximated value of -yj computed 
from the polynomial fit is represented, as jY 

[0051] First, a window size is chosen and centered on the point to be. smoothed, x 
Suitable window sizes. are between . about .1 0% and about 50% (e.g., abput 30 
total span of x t values. . The results may be sensitive to the span, and the optimal span 
depends on a number of factors, including the threshold by which peaks are selected . : 
For example, if the peak selection threshold is . low, yielding a large number of densely 
located points, the optimal span size may be, larger ih^n if the peak selection threshold 
were to yield fewer, less dense points. The span can also be selected by perfonmng 
the smoothing using a few different spans and selecting the one that yields the best 
alignment according to a fit metric, a measure of how well the smoothed values .fit the 
apparent alignment function or of how much the At value varies locally or globally 
across the retention time range. The smoothing can also be evaluated based[ on 
knowledge of the expected result The. N points within the chosen span are fit to a 
weighted polynomial of degree £ (typically, L ~ 2) by mininiizing the regression merit 
function, 



t-0 



: :(5) 



where d k are the polynomial coefficients to be solved for and w/ are tire regression 
weights for each point x t in the span. Initially, the weights Wj are given by a tricubic 
function: 



initial 



r 

1-. 



x-x. 



X X nax 



3 > \ 



, .. . (6) 

j ■ \ ' • . ... 

where x is.the point being smootiied, are, the individual points within. the span, and 
x max is the point farthest , from x. The weights vary smoothly , fiopii 0^ for thp point 
farthest from the smoothed point to 1 for the. smopthed. point . All weights are,zero,for 
points outside the span. The. regression merit function in equation (5) is minimized to 
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determine tjie polynomial coefficients ap For standard LOESS., the smoothed value' $> 
is computed from the polynomial, and the span is moved one point to the right to 
smooth the next point ■ ^ 

[0052] For robust LOESS, these results are used to compute the robust weights 
. based on the residual r, between the raw data value y t and polynomial value}'/ for each 
point in the span: 

and on the median absolute deviation MAD; 

MAD = median (|7v|), (8) 
From these, the robust weights -are computed . 



• robust. 



. (8) " 



. The regression is performed again for the span (from equation (5)) using newly 
computed weights, w,- « * w/°- si to obtain a new curve fit, a new set ;of points y h 
and new residuals r t . This procedure (computing robust weights and fitting the 
polynomial) is repeated until the curve fit converges to a desired precision or for a . 
predetermined number of iterations, e.g. , about 5. Upon convergence, the y value of 
the point being smoothed, x, is replaced with the. curve fit value. Only, that point is . 
replaced— all other points in the span remain the same The span is then shifted one 
point to the right and the entire procedure repeated to smooth the point in the center of 
the span. Each time the curve fit is performed, the yt values used are the raw data 
values, not the smoothed ones End points, are treated as is commonly done in 

• .-smooth^ ^- : r ; '-^ 

[0053] After all y t values are obtained, a mapping from t re f to t iest is determmed, 
• and values for intemiecHate : points are computed by inteipolatipn. . The.retention time 
values of mapped test, points are then adjusted to align the complete data sets. The 
process is repeated for all test.data sets. Note that if the goal of the method is to align 
coiresponciing peaks only, it is not necessary to find aligned tjrne point values for the 
intermediate points. . 
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[0054] Although not limited to any particular .hardware configuration, the present 
invention is typically implemented m software by a system contammg a computer that 
obtains data sets from an. analytical lnstroment (e g , LC-MS ins|rument) or other 
source The LC-MS instrument, includes a liquid chromatography instrument 
connected to a mass spectrometer by an interface.. The computer jmplementmg the 
invention typically contains a processor, memory, data storage medium, display, and. 
input device (e.g., keyboard and mouse). Methods of the. invention are executed by 
the processor under the direction of computer program, code stored in the computer. 
Using techniques well known in the computer arts, such code is tangibly embodied 
within a computer program storage device accessible by the processor, e.g , within 
system memory or on a computer-readable storage medium such as a hard disk or 
CD-ROM The methods may be implemented by any means known m the art For 
example, any number of computer programming languages, such as Java™, C-H-, or 
Perl, may be used Furthermore, various programming approaches such as procedural 
or object oriented may be employed. It is to be understood that, the steps described 
above are highly simplified versions of the actual processing, performed by the 
computer, and that methods containing additional steps or rearrangement of the steps 
described are within the scope of the present invention 

. EXAMPLES 

[0055] The following examples are provided solely to illustrate various 
embodiments of the present invention and are not intended to. limit the scope of the 
invention to the disclosed .details. ■.. \ 

EXAMPLE 1 : Peaks aligned by dynamic tone warpmg 

[0056] Pooled human serum from blood bank samples was. iiltrafiltered through a 
10-lcDa membrane, and the resulting high-molecular weight fraction was reduced with 
ditbiothreitpl (DTT) and carboxymethylated with lodoacetic acid/NaOH before being 
. digested with trypsin. .Digested samples were analyzed on.a bjnaxy.HP 1100 series 
HPLC coupled directly ; to a Micromass (Manchester, UK) LCT?^ electrpspray 
ionization (ESI) tune-of-flight (TOF) mass, spectrometer equipped with a .microspray 
source, PicoFrit^^fused-sihca capillary ^columns (5 jam BioBasic ds, 75.fU» x .10 
cm, New Objective, Woburn, MA) were run at a flow rate.of 300.nL/inin after flow 
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splitting. An on-line trapping, cartridge (Peptide CapTrap, Michrom Bioresources, 
Auburn, CA) allowed fast loading onto the capillary column. Injection volume was 
20 jxL. Gradient elution was achieved using 1.00% solvent A (0 1% formic acid m ■ 
water) to 40% solvent B (0,1% formic acid in acetonitiile) oyer 100 min. " : . 

[0057] Data sets were aligned by dynamic time warping (DTW) unpiemented in . 
MATLAB® (The Math Works, Cambridge, MA) with custom code. 

{0058] .. FIGS., 8A-8B show a small region of data sets corresponding to four 
different samples, before and after alignment of the bottom three data sets (test) to the 
top (reference) data set using DTW. Corresponding peaks are indicated In all cases, 
the aligned peaks are much closer (in retention tune) to the reference peaks than they 
were befpre aligruuent. : ^' ■ ' : '^V^-' : ' ; :°v---- i\ 

EXAMPLE 2 Data sets aligned by dynamic time warping and LOESS 
[0059] * Pooled human serum from blood bank samples was ultrafiltered through a 
10-kDa membrane, and the resulting high-molecular weight fraction was reduced with , . 
dithiothreitol (DTT) and carboxymethylated with iodoacetic acid/NaOH, before being , 
digested with trypsin. Digested samples were analyzed on a binary HP 1100 series 
HPLC coupled directly to a ThermoFinmgan (San Jose, CA) LCQ DECA™ 
electrospray ionization (ESI) ion-trap mass spectrometer using automatic gain control. 
PicoFnt™ fused-silica capillary columns (5 \xm BioBasic Cig, 75 x 10 cm, New. 
Objective, Woburn, MA) were run at a flow rate of 300 nL/min after flow splitting. 
An on-line trapping cartridge (Peptide CapTrap, Michrom Bioresources, Auburn, CA) 
allowed fast loading onto the capillary column Injection volume was 20 juL. 
Gradient elution was achieved using 100% solvent A (0.1% formic acid in water) to 
40% solvent B (0.1% formic acid in acetonitiile) over 100 mm 

[0060] Spectra were -aligned using both .dynamic time waipiiig.(I)TW) aiid : robust 
LOESS. Algorithms were implemented in MATLAB® (The MathWorks, Cambridge, 
MA). Robust LOESS smoothing was performed using a .prepackaged routine in the 
MATLAB® Curve Fitting Tqplbox. DTW was implemented with custom MATLAB® . 
code following the dgpntois described above... : " " '■ ' 
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FIG. 9 is a plot of transformed data set variables -Af vs ; 4vg showing 
alignment by robust LOESS and DTW Inverted triangles represent potentially 
corresponding automatically-selected peaks, filled circles are ppmts , smoothed by 
lobust LOESS, and the tlnn solid Ime.is. the data set corrected by DTW Tile DTW 
points are much more densely spaced, because they are taken fiom the entire data set, 
. rather than selected pealcs only In tins example, both robust LOESS and DTW 
accurately track the time shift, with LOESS following the local, variations more : 



. [0062] It should be noted that the foregoing description is only illustrative of tire. 

. invention Various alternatives and modifications can be devised by those skilled in 
the art without departing Jfrom the invention Accordingly, the present invention is 
intended to embrace all such alternatives, modifications and variances which fall 
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claims : • >. v ■■•' : v V' 

What is claimed is: 

L A computer-impl em ented method . for time-aligning at least two 
cliromatography-mass specfrometry data sets, each comprising a plurality of 
mass clnx>matograrus, said method comprising 

a) computing a distance function, between said data sets in dependence on at 

least two mass chromatograms from each data set; and 
. b) aligning said data sets by mmnnizmg said distance function to obtain 
. aligned data. sets. '' - K \ -\ , . 

2. The method of claim 1, wherein one of said data sets is a reference data set 
and one. of said data sets is a test data set, and wherein said test data set is 
. aligned to said reference data set. 

3. The method of claim 1, wherein said data sets are liquid chi-omatography- 
mass spectrometry data sets. 

4< The. method of claim 1, wherein, said distance {function is. computed in 
, dependence on between about 200 and about 400 mass bhromatpgrams 
from each data set,. 



5, 



The method of claim 1, further comprising selecting said at least two mass 
cliromatograms. according to a selection criterion. 



6. The method of claim 1, wherein said distance function is computed in 
dependence on a chromatogram-dependent weightmg factor, 

7. The method of claim 6, . wherein said cJKpmatp^ani.-depeudent 
weightmg. a .fimction of at least one of a peak number, an 

; mtensity threshold, and a signal-to-noise ratio. 

S. . A plurality of chromatography-mass spectrometry data sets aligned .according to 
the method of claim 1. 
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11. The rriethod of claim 10, wherein one of said selected data sets is a 
reference data set and another of said selected data sets is a test data set, 
and wherein said, test data set is. aligned to said reference data set 

12. The method of claim 10, wherein said chromatographyrmass spectrometry 
is liquid chromatography.-rnass spectrometry. 

13. The method of claim 10, further .comprising aligning two additional data 
sets, wherein at. least one. of said additional data sets differs ;frpm said 
selected data sets. 

14. The. method of claim 10, further comprising selecting said at least two 
. mass 
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9. A program storage device accessible by a processor, tangibly embodying a 

*■ - ' program of instructions executable by said processor to perfonn method, steps ' \ 

for a method for time-aligning clnomato^aphy-mass spectrometry data sets, I 
each comprising a plurality of mass chrpmatograms, said , rnethod steps 5 J 

a) computing a distance function between said. data, sets 111 dependence on at 
least . two mass chrpmatograins from each data set, and 

b) aligning said data sets by minimizing said distance function to obtain 
aligned data sets. . 

10. A method for comparing at least two samples, comprising; 

a) , performing cinematography-mass spectrometry on each sample to obtain 

at least two data sets, each comprising a plurality of mass, chi'omatograms; 

b) computing a distance, function between two selected, data sets in 
dependence on at least twp mass chromatogi;axns from ^&.&l^pd/jteto '■■ 

■ • \-set; - \ 7. :'!V v ;v; : >;:v/,;: : ;-^^;\V : ' : ; V: •':"••■} ■ \ V-.- . ; 

c) aligning said selected data sets by mmiimzing said distance function to 
obtain ahgned selected data sets, and. 
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15. The method, of claiin' 1 4, wherem said selection cii^rion . is a user- 
provided selection criterion : 

16 The method of claim 14, wherein said selection criterion comprises 
an intensity threshold. > 



17. The method of claim 14, wherein said selection criterion comprises, a 
number of chromatograms 

18 The method of claim 14, wherem said selection criterion comprises 
an orthogonality metric 



.1.9. The method Of claim 14, wherein said selection criterion comprises a 
retention time range. 

20. The method of claim 10, wherein said distance function is computed in 
dependence on between about 200 and about 400 mass chromatograms 

21 The method of claim 10, wherem said, distance function is, computed, m 
dependenceon between about 200 and about 30P mass clironiatpgrams. 

22 The method of claim 10, wherein said distance function is computed in 
dependence on about 200 mass chromatograms. 

23 The method of claim 10, wherein said distance function is computed m 
dependence on a weighting factor. 

.24. The method of claim 23, wherein said weighting factor .is a 
cliromatograin-dependent weighting factor. 

' : 25. .. The method : : of claim 24, .wherein said xhromatogram- 
['%. dependent weighting factor is a function of at least one of a 
•peak number, an mtensity threshold, and a sigpalrto-noise ratio. 

: ' ' ; :: 23 •• "• "• ' •■•-•".' : v; '.- : 
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26 The method of claim 10, further comprising identifying features that . : | 

differentiate said ahgned. selected ■data, sets , • : ! 

27 A plurality of samples compared according to the method of claim 10 m 

- 28. Amemodforidentifymgabiom^ • \ j 

a) comparing at least two samples according to the method of claim 10, at .. . J 
least one each of said samples representing a different one of said two ; .j 

. : cohorts; and. ^-••:.S-.C:V' ; - 5 ..v>'';;' :, V^;} :'Vv^'\r/^'-^urr^ ">%:\^^v- \ 

b) identifying a biomarker m dependence on said comparison. 

29 A biornarker identified by the method of claim 28, \ I 

30 A diagnostic method comprising detecting a biomaiker identified by the method 

of claim 28. [ [ -« j 

31. A computer-implemented method for time-aligmng at least two two- 
dimensional chromatography-mass spectrometry data sets, comprising ■ j 

a) selecting peaks in said data sets; i 

b) identifying p jtentialiy corresponding peaks from said selected peaks, and j 

c) performing a locally-weighted regression smooth j 
conespondmg peaks to. obtain aligned data sets j 

.'•■v./- 32. v ; )The -method \ 

; ' b • •. ' •^ : 

is aligned to said reference data set. 

■ 33.. The.method. of cMm.31, wherein said.data sets ar^ 

mass spectrometry. data sets. •• .-; ^ • "•• | 

. . " • .' : 34. The method of claim 31, wherein said locally-we.ighted regression , j 

smootlung is a robust locally-weighted regression smoothing. .. j 

1 
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35. The method of claim 34, wherein said robust locally-weighted 
regression smoothing comprises robust LOESS. 

.36. . The method of claim 3 1 , wherein said peaks are selected automatically 

37. The - method of claim 31, wherein said locally- weighted regression 
smoothing is performed in dependence on a span 

38. A plurality of cliromatography-mass. spectrometry data sets aligned according to 
the method of claim 3. 1 . • N . \ '•-"<"" . \'-'- 



CO) 



40. A method for comparing at least two samples, comprising; 

a) performmg chromatography-rnass spectrometry on each sample to obtain 
at least two two-dmiensional data sets; 

b) selecting peaks in two. selected data sets; 

c) identifying potentially con-esponding peaks frpm said selected peaks, 

d) performmg a locally-weighted regression smoothing on. said potentially 
corresponding peaks to obtain aligned selected data sets; and : v 

: -- e) .;< 



41.. The iriethpd of claim. 40, wherein one • of s^id , selected data , sets is a 
; ■ ^reference . data set and another of said selected data, sets is a test data set, 



39. A program storage device accessible by a, processor, tangibly embodying a | 

program of instructions executable by saidj processor to perform method steps I 

foi a method for time-ahgmng two-dimensional cln*omatpgraphy-mass ; J 

spectrometry data sets, said method steps comprising \. S 

a) selecting peaks in said data sets; . I 

b) identifying potentially corresponding peaks from said selected peaks, and { 
. c) performing a locally-weighted regression smoothing on said potentially I 
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. 42. .. The ; .methQd : ofdaim4p,whe 

is liquid ctoromatography-mass spectrometiy. 

43. The method of claim 40, further comprising aligning two additional data . 
sets, wherein at least one of said additional data sets differs fiom said . 

' ... .selecteddatasets. \" ' ■ /\y"\:^_ : -V"v-~ -v.^\ : ; ; ; \\>'v : '^\' l \'vVV : 7 1%,: 

44, The method of claim 40, forthei comprising ldenhfying features that . 



45 A plurality of samples compared according to the method of claim 40 

46 • A method for identifying a biomarker differentiatrng two cohorts, comprising. 

a) comparing at least two samples according to the method of claim 40, at 
least one each of said samples representing a different one of said two 

cohorts; and.; v •/'.'•.' " 

b) identifying a biomarker in dependence on said comparison. 

47 A biomarker identified by the method of claim 40. 
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